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Abstract 

O . 

CO , Gaussian graphical models are of great interest in statistical learning. Because the 

conditional independencies between different nodes correspond to zero entries in 
the inverse covariance matrix of the Gaussian distribution, one can learn the struc- 
ture of the graph by estimating a sparse inverse covariance matrix from sample 
' data, by solving a convex maximum likelihood problem with an £i-regularization 

C/3 \ term. In this paper, we propose a first-order method based on an alternating lin- 

O . earization technique that exploits the problem's special structure; in particular, the 

subproblems solved in each iteration have closed-form solutions. Moreover, our 
algorithm obtains an e-optimal solution in 0(1/ e) iterations. Numerical experi- 
ments on both synthetic and real data from gene association networks show that a 
f — ' practical version of this algorithm outperforms other competitive algorithms. 

o\" 
o 

1 Introduction 

In multivariate data analysis, graphical models such as Gaussian Markov Random Fields pro- 
vide a way to discover meaningful interactions among variables. Let Y = {yW, . . . , ?/")} be 
an n-dimensional random vector following an n-variate Gaussian distribution Af(^, £), and let 
G = (V, E) be a Markov network representing the conditional independence structure of J\f(/i, £). 
Specifically, the set of vertices V = {1, . . . ,n} corresponds to the set of variables in Y, and the 
edge set E contains an edge if and only if yW is conditionally dependent on y^> given all 
remaining variables; i.e., the lack of an edge between i and j denotes the conditional indepen- 
dence of j/W and yv), which corresponds to a zero entry in the inverse covariance matrix 
(lEl)- Thus learning the structure of this graphical model is equivalent to the problem of learning the 
zero-pattern of To estimate this sparse inverse covariance matrix, one can solve the following 
sparse inverse covariance selection (SICS) problem: maxxes™ + logdet(X) — (£, X) — p\\X\\o, 
where 5" + denotes the set of n x n positive definite matrices, ||X||o is the number of nonzeros in 
X, £ = jj Y%=i(Yi ~ $)(Xi — P) 1 is tne sample covariance matrix, p = - i s tne sample 

mean and Yi is the z-th random sample of Y . This problem is NP-hard in general due to the com- 
binatorial nature of the cardinality term p||X||o ([2]). To get a numerically tractable problem, one 
can replace the cardinality term ||X||o by ||X||i := £\ . \Xj,j\, the envelope of ||X||o over the set 
{X € R™ xn : llXlloo < 1} (see [3]). This results in the convex optimization problem (see e.g., 

ansa): 

min -logdet(X) + (S,X)+p||X|| 1 . (1) 
XeS ++ 

Note that ([TJ) can be rewritten as min^gsj max ||(7||oo<p — logdet X + (S + U, X), where ||J7||oo 
is the largest absolute value of the entries of U. By exchanging the order of max and min, we obtain 
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the dual problem max||[/|| oo <p minjs: e 5n + — logdetX + (E + U,X), which is equivalent to 

max {logdetT^ + n : H^-Slloo <p}. (2) 

Both the primal and dual problems have strictly convex objectives; hence, their optimal solutions 
are unique. Given a dual solution W, X = W^ 1 is primal feasible resulting in the duality gap 

gap-^&W-^+pWW-'h-n. (3) 



The primal and the dual SICS problems (Q]) and (fJI are semidefinite programming problems and can 
be solved via interior point methods (IPMs) in polynomial time. However, the per-iteration com- 
putational cost and memory requirements of an IPM are prohibitively high for the SICS problem. 
Although an approximate IPM has recently been proposed for the SICS problem [8], most of the 
methods developed for it are first-order methods. Banerjee et al. 01 proposed a block coordinate 
descent (BCD) method to solve the dual problem ©. Their method updates one row and one column 
of W in each iteration by solving a convex quadratic programming problem by an IPM. The glasso 
method of Friedman et al. [5] is based on the same BCD approach as in Q], but it solves each sub- 
problem as a LASSO problem by yet another coordinate descent (CD) method [9]. Sun et al. ifToll 
proposed solving the primal problem (fTJ by using a BCD method. They formulate the subproblem 
as a min-max problem and solve it using a prox method proposed by Nemirovski ifTHl . The SINCO 
method proposed by Scheinberg and Rish 111211 is a greedy CD method applied to the primal problem. 
All of these BCD and CD approaches lack iteration complexity bounds. They also have been shown 
to be inferior in practice to gradient based approaches. A projected gradient method for solving 
the dual problem (f2| that is considered to be state-of-the-art for SICS was proposed by Duchi et al. 



1 13]. However, there are no iteration complexity results for it either. Variants of Nesterov's method 
1 14, [ID] have been applied to solve the SICS problem. d'Aspremont et al. lfl6ll applied Nesterov's 
optimal first-order method to solve the primal problem ([TJ after smoothing the nonsmooth i\ term, 
obtaining an iteration complexity bound of 0(1/ e) for an e-optimal solution, but the implementation 
in lfl6ll was very slow and did not produce good results. Lu [ 17] solved the dual problem (f2]), which 
is a smooth problem, by Nesterov's algorithm, and improved the iteration complexity to 0(l/yfe). 
However, since the practical performance of this algorithm was not attractive, Lu gave a variant 
(VSM) of it that exhibited better performance. The iteration complexity of VSM is unknown. Yuan 
111811 proposed an alternating direction method based on an augmented Lagrangian framework (see 
the ADAL method $Q below). This method also lacks complexity results. The proximal point algo- 
rithm proposed by Wang et al. in lfl9]l requires a reformulation of the problem that increases the size 
of the problem making it impractical for solving large-scale problems. Also, there is no iteration 
complexity bound for this algorithm. The IPM in [8] also requires such a reformulation. 

Our contribution. In this paper, we propose an alternating linearization method (ALM) for solving 
the primal SICS problem. An advantage of solving the primal problem is that the l\ penalty term in 
the objective function directly promotes sparsity in the optimal inverse covariance matrix. 

Although developed independently, our method is closely related to Yuan's method [18]. Both 
methods exploit the special form of the primal problem (Q~|i by alternatingly minimizing one of 
the terms of the objective function plus an approximation to the other term. The main difference 
between the two methods is in the construction of these approximations. As we will show, our 
method has a theoretically justified interpretation and is based on an algorithmic framework with 
complexity bounds, while no complexity bound is available for Yuan's method. Also our method 
has an intuitive interpretation from a learning perspective. Extensive numerical test results on both 
synthetic data and real problems have shown that our ALM algorithm significantly outperforms 
other existing algorithms, such as the PSM algorithm proposed by Duchi et al. 111311 and the VSM 
algorithm proposed by Lu 1 17]. Note that it is shown in [13] and [ 17] that PSM and VSM outperform 
the BCD method in [7] and glasso in @]. 

Organization of the paper. In Section 2 we briefly review alternating linearization methods for 
minimizing the sum of two convex functions and establish convergence and iteration complexity 
results. We show how to use ALM to solve SICS problems and give intuition from a learning 
perspective in Section 3. Finally, we present some numerical results on both synthetic and real data 
in Section 4 and compare ALM with PSM algorithm O and VSM algorithm (T% . 
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2 Alternating Linearization Methods 



We consider here the alternating linearization method (ALM) for solving the following problem: 

min F(x) = f(x) + g(x), (4) 

where / and g are both convex functions. An effective way to solve (O is to "split" / and g by 
introducing a new variable, i.e., to rewrite (01 as 

min{/(.T) + g(y) :x-y = 0}, (5) 

and apply an alternating direction augmented Lagrangian method to it. Given a penalty parameter 
l//x, at the k-th iteration, the augmented Lagrangian method minimizes the augmented Lagrangian 
function 

£{x,y;\) := f{x) + g(y) - (\,x - y) + — \\x-y\\l, 

2/i 

with respect to x and y, i.e., it solves the subproblem 

(x k ,y k ) := argmin£(a;, 2 /;A fe ), (6) 

and updates the Lagrange multiplier A via: 

A fc+1 := X k - (x k - y k )/fi. (7) 

Since minimizing C(x, y. A) with respect to x and y jointly is usually difficult, while doing so with 
respect to x and y alternatingly can often be done efficiently, the following alternating direction 
version of the augmented Lagrangian method (ADAL) is often advocated (see, e.g., B201 12110 : 

arg mm x £(x,y k ;X k ) 

a,rgmin y £(x k+1 ,y;X k ) (8) 



y k+l 

X k+1 



If we also update A after we solve the subproblem with respect to x, we get the following symmetric 
version of the ADAL method. 



x k+1 :— argmiria; £(x, y k ; X k ) 

y k+l 

x k , +1 



— (x k+1 — y k )l fj, 



= argmiriy £(x k+1 ,y; X k+1 ) 
= X k+1 - (x k+1 -y k+1 )/fi. 



Algorithm (O has certain theoretical advantages when / and g are smooth. In this case, from the 
first-order optimality conditions for the two subproblems in (0, we have that: 

X k+1 -V/(x fc+1 ) and A£ +1 = -V. 9 (y fc+1 ). (10) 

Substituting these relations into (0, we obtain the following equivalent algorithm for solving ©, 
which we refer to as the alternating linearization minimization (ALM) algorithm. 

Algorithm 1 Alternating linearization method (ALM) for smooth problem 

Input: x° = y° 
for k = 0, 1, • ■ ■ do 

1. Solve a;^ 1 := argmin x Q g (x, y k ) = f(x)+g(y k ) + {Vg{y k ),x - y k ) + ±\\x - y k \\l; 

2. Solve y k+1 := argmin y Q f (x k+1 , y) = f{x k+1 ) + (S7f(x k+1 ),y~x k+1 ) + ^\\y- 

x k+1 \\l+g{y); 
end for 



Algorithm[T]can be viewed in the following way: at each iteration we construct a quadratic approxi- 
mation of the function g(x) at the current iterate y k and minimize the sum of this approximation and 
f(x). The approximation is based on linearizing g(x) (hence the name ALM) and adding a "prox" 
term ^-\\x — y k \\2- When /i is small enough (/i < 1/L(g), where L(g) is the Lipschitz constant for 
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Vg) this quadratic function, g(y k ) + (Vg(y k ), x — y fe ) + j^\\x — y fc ||| is an upper approximation to 
g(x), which means that the reduction in the value of F(x) achieved by minimizing Q g (x, y k ) in Step 
1 is not smaller than the reduction achieved in the value of Q g (x, y k ) itself. Similarly, in Step 2 we 
build an upper approximation to f(x) at x k+1 , f(x k+1 ) + (yf(x k+1 ),y — x k+1 ) + -^\\y - x k+1 \\\, 
and minimize the sum Q f(x k+1 , y) of it and g(y). 

Let us now assume that f(x) is in the class C 1,1 with Lipschitz constant L(f), while g(x) is simply 
convex. Then from the first-order optimality conditions for the second minimization in (O, we have 
— Ay +1 € dg(y k+1 ), the subdifferential of g(y) at y = y k+1 . Hence, replacing Vg(y k ) in the 
definition of Q g (x, y k ) by — A^ +1 in (O, we obtain the following modified version of (O. 



Algorithm 2 Alternating linearization method with skipping step 

Input: x° = y° 

for k = 0, 1, • • • do 

1. Solve x k+1 := argmin x Q(x, y k ) = f(x)+g(y k ) - (\ k ,x-y k ) + ±\\x - y k \\l; 

2. lfF(x k+1 ) > Q{x k+1 ,y k ) thenx k+1 := y k . 

3. Solve y k+1 :— argmin,, Qf(x k+1 , y); 

4. A fe+1 = Vf(x k+1 ) - {x k+1 - y k+1 )/fi. 
end for 



Algorithm |2] is identical to the symmetric ADAL algorithm (O as long as F(x k+1 ) < Q(x k+1 , y k ) 
at each iteration (and to Algorithm [T] if g(x) is in C 11 and fi < 1/ max{L(/), L(g)}). If this con- 
dition fails, then the algorithm simply sets x k+1 y k . Algorithm|2]has the following convergence 
property and iteration complexity bound. For a proof see the Appendix. 

Theorem 2.1. Assume V / is Lipschitz continuous with constant L(f). For/3/L(f) < fj, < 1/L(f) 
where < j3 < 1, Algorithm\2\satisfies 

where x* is an optimal solution of © and k n is the number of iterations until the k — thfor which 
F(x k+1 ) < Q(x k+1 , y k ). Thus Algorithm\2\produces a sequence which converges to the optimal 
solution in function value, and the number of iterations needed is 0(1/ e) for an e-optimal solution. 

If g(x) is also a smooth function in the class C 1, with Lipschitz constant L(g) <l/fi, then Theorem 
12. 11 also applies to Algorithm [T] since in this case k n — k (i.e., no "skipping" occurs). Note that the 
iteration complexity bound in Theorem l2.1l can be improved. Nesterov II 15112211 proved that one can 
obtain an optimal iteration complexity bound of 0(1/ y/e), using only first-order information. His 
acceleration technique is based on using a linear combination of previous iterates to obtain a point 
where the approximation is built. This technique has been exploited and extended by Tseng 112311 . 
Beck and Teboulle M24I1 . Goldfarb et al. [25] and many others. A similar technique can be adopted 
to derive a fast version of Algorithm [2] that has an improved complexity bound of 0(l/y/e), while 
keeping the computational effort in each iteration almost unchanged. However, we do not present 
this method here, since when applied to the SICS problem, it did not work as well as Algorithm|2] 

3 ALM for SICS 

The SICS problem 

mm F(X) = f(X) + g(X), (12) 

X£b + + 

where f(X) = — logdet(X) + (S, X) and g(X) = p\\X\\i, is of the same form as (0J|. However, 
in this case neither f(X) nor g(X) have Lipschitz continuous gradients. Moreover, f(X) is only 
defined for positive definite matrices while g(X) is defined everywhere. These properties of the 
objective function make the SICS problem especially challenging for optimization methods. Nev- 
ertheless, we can still apply (0 to solve the problem directly. Moreover, we can apply Algorithm|2] 
and obtain the complexity bound in Theorem |2.1| as follows. 
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The logdet(X) term in /(A) implicitly requires that X E S™ + and the gradient of f(X), which 
is given by — A -1 + £, is not Lipschitz continuous in Fortunately, as proved in Proposition 

3.1 in [17], the optimal solution of ( TT2T > X* >z al, where a — — . Therefore, if we define 

||£||+np 

C := {X e S n : X t fl}, the SICS problem (O can be formulated as: 

min{/(X) + g(Y) : X - Y = 0, X E C, Y E C}. (13) 

We can include constraints X E C in Step 1 and Y E C in Step 3 of Algorithm [2] Theorem 12. 11 
can then be applied as discussed in j25ll . However, a difficulty now arises when performing the 
minimization in Y. Without the constraint Y E C, only a matrix shrinkage operation is needed, 
but with this additional constraint the problem becomes harder to solve. Minimization in X with or 
without the constraint X E C is accomplished by performing an SVD. Hence the constraint can be 
easily imposed. 

Instead of imposing constraint Y E C we can obtain feasible solutions by a line search on fj,. We 
know that the constraint X y ?7 is not tight at the solution. Hence if we start the algorithm with 
X y al and restrict the step size /x to be sufficiently small then the iterates of the method will 
remain in C. 

Note however, that the bound on the Lipschitz constant of the gradient of f(X) is 1/a 2 and hence 
can be very large. It is not practical to restrict n in the algorithm to be smaller than a 2 , since ji 
determines the step size at each iteration. Hence, for a practical approach we can only claim that the 
theoretical convergence rate bound holds in only a small neighborhood of the optimal solution. We 
now present a practical version of our algorithm applied to the SICS problem. 



Algorithm 3 Alternating linearization method (ALM) for SICS 

Input: X" = Y°, ^ 
for k = 0, 1, • • • do 

0. Pick^ fe+ i < ju fc . 

1. Solve X k+1 := argminxec f(X) + g(Y k ) - (A fe ,A -Y k ) + 2^7!!^ - 

2. If g{X k+1 ) > g(Y k ) - (A*, Jf^ 1 - Y k ) + ^-±_||A fe+1 - Y k \\%, then X k+1 := Y k . 

3. Solve Y k+1 := argminy f{X k+1 ) + (V f(X k+1 ),Y — X k+1 ) + ^— \\Y - X k+1 \\ 2 F + 

4. A fc+1 = Vf(X k+1 ) - (X k+1 - Y k+1 ) / Hk+i- 
end for 



We now show how to solve the two optimization problems in Algorithmic The first-order optimality 
conditions for Step 1 in Algorithm[3j ignoring the constraint X E C are: 

V/(A) - A fc + (X - Y k )/n k+1 = 0. (14) 

Consider VDiag(<i)V^ T - the spectral decomposition of Y k + fik+i (A fe — S) and let 



7< = I * + V d i + 4yu fc+ i J /2, i = 1, . . . , n. (15) 

Since V/(X) = — A -1 + E, it is easy to verify that X k+1 := FDiag(7)F T satisfies (E). When 
the constraint X E C is imposed, the optimal solution changes to X k+1 :— yDiag(7)F T with 

7,: = max |a/2, (di + \J d\ + A/ik+ij /2| , i — 1, . . . , n. We observe that solving ( TT4"i > requires 

approximately the same effort (0(n 3 )) as is required to compute V/(A fe+1 ). Moreover, from the 
solution to (fl4l . V/(A fc+1 ) is obtained with only a negligible amount of additional effort, since 

(A fc + 1 )- 1 := T/Diag( 7 )- 1 T/ T . 

The first-order optimality conditions for Step 2 in Algorithm|3]are: 

e V/(A fe+1 ) + (Y - X k+1 )/fi k+1 + dg(Y). (16) 
Since g(Y) = p||Y||i, it is well known that the solution to ( fT6] l is given by 

Y k+1 = shrink(A fc+1 - Mfc+1 (E - (X^ 1 )" 1 ), /i fc+1 p), 
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where the "shrinkage operator" shrink(Z, p) updates each element Z^j of the matrix Z by the for- 
mula shrink(Z, p)ij = sgn(Zy) • rnax{ | .Z^ | — p, 0}. 

The 0(n 3 ) complexity of Step 1, which requires a spectral decomposition, dominates the 0(n 2 ) 
complexity of Step 2 which requires a simple shrinkage. There is no closed-form solution for the 
subproblem corresponding to Y when the constraint Y e C is imposed. Hence, we neither impose 
this constraint explicitly nor do so by a line search on p k , since in practice this degrades the perfor- 
mance of the algorithm substantially. Thus, the resulting iterates Y k may not be positive definite, 
while the iterates X k remain so. Eventually due to the convergence of Y k and X k , the Y k iterates 
become positive definite and the constraint Y E C is satisfied. 

Let us now remark on the learning based intuition behind Algorithm [3] We recall that — A fc £ 
dg(Y k ). The two steps of the algorithm can be written as 

X k+1 := a rgmm{f(X) + -!—\\X-(Y k +p k+1 A k )\\ F } (17) 
xec ^Pk+i 

and 

Y k+1 := argmin{ 5 (r) + ^—\\Y - (X k+1 - p k+1 (t - (A fe+1 ) _1 ))|||,}. (18) 

The SICS problem is trying to optimize two conflicting objectives: on the one hand it tries to find a 
covariance matrix A -1 that best fits the observed data, i.e., is as close to E as possible, and on the 
other hand it tries to obtain a sparse matrix X. The proposed algorithm address these two objectives 
in an alternating manner. Given an initial "guess" of the sparse matrix Y k we update this guess 
by a subgradient descent step of length pk+i- Y k + p k+ \A k . Recall that — A k E dg{Y k ). Then 
problem (fTTT i seeks a solution X that optimizes the first objective (best fit of the data) while adding 
a regularization term which imposes a Gaussian prior on X whose mean is the current guess for the 
sparse matrix: Y k + p k+ iA k . The solution to ( fTTI i gives us a guess for the inverse covariance X k+1 . 
We again update it by taking a gradient descent step: X k+1 — /i/-_|_i(E — (A' c+1 ) _1 ). Then problem 
(fT8l seeks a sparse solution Y while also imposing a Gaussian prior on Y whose mean is the guess 
for the inverse covariance matrix X k+1 — ^tfe+i(S — (A fc+1 ) _1 ). Hence the sequence of A^'s is 
a sequence of positive definite inverse covariance matrices that converge to a sparse matrix, while 
the sequence of Y k, s is a sequence of sparse matrices that converges to a positive definite inverse 
covariance matrix. 

An important question is how to pick /ife+i. Theory tells us that if we pick a small enough value, 
then we can obtain the complexity bounds. However, in practice this value is too small. We discuss 
the simple strategy that we use in the next section. 



4 Numerical Experiments 

In this section, we present numerical results on both synthetic and real data to demonstrate the 
efficiency of our SICS ALM algorithm. Our codes for ALM were written in MATLAB. All nu- 
merical experiments were run in MATLAB 7.3.0 on a Dell Precision 670 workstation with an Intel 
Xeon(TM) 3.4GHZ CPU and 6GB of RAM. 

Since — A fc € dg(Y k ), || A fc ||oo < p; hence E — A k is a feasible solution to the dual problem (f2]l as 
long as it is positive definite. Thus the duality gap at the fc-th iteration is given by: 

Dgap := - logdct(A' £ ) + (S, X k ) + p\\X k \\ 1 - logdet(S - A k ) - n. (19) 

We define the relative duality gap as: Rel.gap :— Dgap/(1 + \pobj\ + \dobj |), where pobj anddobj 
are respectively the objective function values of the primal problem ( fT2l at point X k , and the dual 
problem (O at E — A k . Defining dk(<fi(%)) = max{l, 4>(x k ), <fi(x k ~ 1 )}, we measure the relative 
changes of objective function value F(X) and the iterates X and Y as follows: 

_ \F(x k ) — F(x k ~ 1 )\ _ \\x k — x k ~ 1 \\ F _ ||y fe -y fe - 1 || F 

Frd - d k (\F(X)\) ' Xrd - d k {\\X\\ F ) ' Yrd - d(\\Y\\ F ) ■ 

We terminate ALM when either 

(i) Dgap < e g a p or (ii) max{Frel, Xrel, Yrel} < e re ;. (20) 
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Note that in dT9b , computing log det(X k ) is easy since the spectral decomposition of X k is already 
available (see ( PT4b and Sl5[), but computing logdet(S — A fc ) requires another expensive spectral 
decomposition. Thus, in practice, we only check d20b(i) every N gap iterations. We check d20b(ii) at 
every iteration since this is inexpensive. 

A continuation strategy for updating p is also crucial to ALM. In our experiments, we adopted the 
following update rule. After every iterations, we set /i := max{/i ■ r)^, p,}; i.e., we simply reduce 
p by a constant factor r\ p every iterations until a desired lower bound on p is achieved. 

We compare ALM (i.e., Algorithm [3] with the above stopping criteria and p updates), with the 
projected subgradient method (PSM) proposed by Duchi et al. in lfl3f1 and implemented by Mark 
Schmidt Q and the smoothing method (VSM)Q proposed by Lu in Il7ll . which are considered to be 
the state-of-the-art algorithms for solving SICS problems. The per-iteration complexity of all three 
algorithms is roughly the same; hence a comparison of the number of iterations is meaningful. The 
parameters used in PSM and VSM are set at their default values. We used the following parameter 
values in ALM: e gap = lCT 3 ,e re/ = Kr 8 ,N gap = 20, iV^ = 20, p = max{/x ?^, 10~ 6 }, ?/p = 
1/3, where [iq is the initial fi which is set according to p; specifically, in our experiments, [iq = 
100/ p, if p < 0.5, p = P if 0.5 < p < 10, and p Q = p/100 if p > 10. 

4.1 Experiments on synthetic data 

We randomly created test problems using a procedure proposed by Scheinberg and Rish in lfl2ll . 
Similar procedures were used by Wang et al. in fll9fl and Li and Toh in 180 . For a given dimension n, 
we first created a sparse matrix U G K" xn with nonzero entries equal to -1 or 1 with equal proba- 
bility. Then we computed S := (U * C/ T ) _1 as the true covariance matrix. Hence, was sparse. 
We then drew p = 5n iid vectors, Yi, . . . , Y p , from the Gaussian distribution Af(0, S) by using the 
mvnrnd function in MATLAB, and computed a sample covariance matrix £ := ~ Yli=i ^i^i ■ 
We compared ALM with PSM IU3I1 and VSM IU7I1 on these randomly created data with different 
p. The PSM code was terminated using its default stopping criteria, which included d20"l)(i) with 
e gap — 10 -3 . VSM was also terminated when Dgap < 10~ 3 . Since PSM and VSM solve the 
dual problem (0, the duality gap which is given by (O is available without any additional spectral 
decompositions. The results are shown in TableQ] All CPU times reported are in seconds. 



Table 1: Comparison of ALM, PSM and VSM on synthetic data 





ALM 


PSM 


VSM 


a 


iter | Dgap | Rel.gap | CPU 


iter | Dgap | Rel.gap | CPU 


iter | Dgap | Rel.gap | CPU 


p = 0.1 


200 


300 


8.70e-4 


1.51e-6 


13 


1682 


9.99e-4 


1.74e-6 


38 


857 


9.97e-4 


1.73e-6 


37 


500 


220 


5.55e-4 


4.10e-7 


84 


861 


9.98e-4 


7.38e-7 


205 


946 


9.98e-4 


7.38e-7 


377 


1000 


180 


9.92e-4 


3.91e-7 


433 


292 


9.91e-4 


3.91e-7 


446 


741 


9.97e-4 


3.94e-7 


1928 


1500 


199 


1.73e-3 


4.86e-7 


1405 


419 


9.76e-4 


2.74e-7 


1975 


802 


9.98e-4 


2.80e-7 


6340 


2000 


200 


6.13e-5 


1.35e-8 


3110 


349 


1.12e-3 


2.46e-7 


3759 


915 


1.00e-3 


2.20e-7 


16085 


p = 0.5 


200 


140 


9.80e-4 


1.15e-6 


6 


6106 


1.00e-3 


1.18e-6 


137 


1000 


9.99e-4 


1.18e-6 


43 


500 


100 


1.69e-4 


7.59e-8 


39 


903 


9.90e-4 


4.46e-7 


212 


1067 


9.99e-4 


4.50e-7 


425 


1000 


100 


9.28e-4 


2.12e-7 


247 


489 


9.80e-4 


2.24e-7 


749 


1039 


9.95e-4 


2.27e-7 


2709 


1500 


140 


2.17e-4 


3.39e-8 


1014 


746 


9.96e-4 


1.55e-7 


3514 


1191 


9.96e-4 


1.55e-7 


9405 


2000 


160 


4.70e-4 


5.60e-8 


2529 


613 


9.96e-4 


1.18e-7 


6519 


1640 


9.99e-4 


1.19e-7 


28779 


p = 1.0 


200 


180 


4.63e-4 


4.63e-7 


8 


7536 


1.00e-3 


1.00e-6 


171 


1296 


9.96e-4 


9.96e-7 


57 


500 


140 


4.14e-4 


1.56e-7 


55 


2099 


9.96e-4 


3.76e-7 


495 


1015 


9.97e-4 


3.76e-7 


406 


1000 


160 


3.19e-4 


6.07e-8 


394 


774 


9.83e-4 


1.87e-7 


1172 


1310 


9.97e-4 


1.90e-7 


3426 


1500 


180 


8.28e-4 


1 .07e-7 


1304 


1088 


9.88e-4 


1.27e-7 


5100 


1484 


9.96e-4 


1.28e-7 


11749 


2000 


240 


9.58e-4 


9.37e-8 


3794 


1158 


9.35e-4 


9.15e-8 


12310 


2132 


9.99e-4 


9.77e-8 


37406 



From Table Q] we see that on these randomly created SICS problems, ALM outperforms PSM and 
VSM in both accuracy and CPU time with the performance gap increasing as p increases. For 
example, for p = 1.0 and n — 2000, ALM achieves Dgap = 9.58e — 4 in about 1 hour and 15 
minutes, while PSM and VSM need about 3 hours and 25 minutes and 10 hours and 23 minutes, 
respectively, to achieve similar accuracy. 

'The MATLAB can be downloaded from http://www.cs.ubc.ca/~schmidtm/Software/PQN.html 
2 The MATLAB code can be downloaded from http://www.math.sfu.ca/~zhaosong 
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4.2 Experiments on real data 



We tested ALM on real data from gene expression networks using the five data sets from (U provided 
to us by Kim-Chuan Toh: (1) Lymph node status; (2) Estrogen receptor; (3) Arabidopsis thaliana; 
(4) Leukemia; (5) Hereditary breast cancer. See fl8|] and references therein for the descriptions of 
these data sets. Table |2]presents our test results. As suggested in Q§], we set p = 0.5. From Tablef2] 
we see that ALM is much faster and provided more accurate solutions than PSM and VSM. 



Table 2: Comparison of ALM, PSM and VSM on real data 







ALM 


PSM 


VSM 


prob. 


n 


iter 


Dgap 


Rel.gap 


CPU 


iter 


Dgap 


Rel.gap 


CPU 


iter 


Dgap 


Rel.gap 


CPU 


(i) 


587 


60 


9.41e-6 


5.78e-9 


35 


178 


9.22e-4 


5.67e-7 


64 


467 


9.78e-4 


6.01e-7 


273 


(2) 


692 


80 


6.13e-5 


3.32e-8 


73 


969 


9.94e-4 


5.38e-7 


531 


953 


9.52e-4 


5.16e-7 


884 


(3) 


834 


100 


7.26e-5 


3.27e-8 


150 


723 


1.00e-3 


4.50E-7 


662 


1097 


7.31e-4 


3.30e-7 


1668 


(4) 


1255 


120 


6.69e-4 


1.97e-7 


549 


1405 


9.89e-4 


2.91e-7 


4041 


1740 


9.36e-4 


2.76e-7 


8568 


(5) 


1869 


160 


5.59e-4 


1.18e-7 


2158 


1639 


9.96e-4 


2.10e-7 


14505 


3587 


9.93e-4 


2.09e-7 


52978 



4.3 Solution Sparsity 

In this section, we compare the sparsity patterns of the solutions produced by ALM, PSM and VSM. 
For ALM, the sparsity of the solution is given by the sparsity of Y. Since PSM and VSM solve 
the dual problem, the primal solution X, obtained by inverting the dual solution W, is never sparse 
due to floating point errors. Thus it is not fair to measure the sparsity of X or a truncated version 
of X. Instead, we measure the sparsity of solutions produced by PSM and VSM by appealing 
to complementary slackness. Specifically, the (i,j)-th element of the inverse covariance matrix 
is deemed to be nonzero if and only if \Wij — Sy| = p. We give results for a random problem 
(n = 500) and the first real data set in Table [3] For each value of p, the first three rows show 
the number of nonzeros in the solution and the last three rows show the number of entries that are 
nonzero in the solution produced by one of the methods but are zero in the solution produced by 
the other method. The sparsity of the ground truth inverse covariance matrix of the synthetic data 
is 6.76%. From Table [3] we can see that when p is relatively large (p > 0.5), all three algorithms 



Table 3: Comparison of sparsity of solutions produced by ALM, PSM and VSM 



p 


100 


50 


10 


5 


1 


0.5 


0.1 


0.05 


0.01 


synthetic problem data 


ALM 


700 


2810 


11844 


15324 


28758 


37510 


63000 


75566 


106882 


PSM 


700 


2810 


11844 


15324 


28758 


37510 


63000 


75566 


106870 


VSM 


700 


2810 


11844 


15324 


28758 


37510 


63000 


75568 


106876 


ALM vs PSM 























2 


14 


PSM vs VSM 


























8 


VSM vs ALM 























2 


2 


real problem data 


ALM 


587 


587 


587 


587 


587 


4617 


37613 


65959 


142053 


PSM 


587 


587 


587 


587 


587 


4617 


37615 


65957 


142051 


VSM 


587 


587 


587 


587 


587 


4617 


37613 


65959 


142051 


ALM vs PSM 























2 


2 


PSM vs VSM 




















2 








VSM vs ALM 






























produce solutions with exactly the same sparsity patterns. Only when p is very small, are there slight 
differences. We note that the ROC curves depicting the trade-off between the number of true positive 
elements recovered versus the number of false positive elements as a function of the regularization 
parameter p are also almost identical for the three methods. 
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5 Appendix 

We show in the following that the iteration complexity of AlgorithmEJis 0(1/ e) to get an e-optimal 
solution. First, we need the following definitions and a lemma which is a generalization of Lemma 
2.3 in [24]. Let # : R n -> K and $ : E n — > R be convex functions and define 

Q^,(u,v) := <j){u) +ij)(v) + (^(v),u-v) + ^-\\u-v\\j, 

where 7^(w) is any subgradient in the subdifferential dtp(v) of ip(v) at the point v, and 

Pib(v) := argmmQ,p(u,v). (21) 

u 

Lemma 5.1. Let $(•) = (/)(■) + ip(-). For any v, if 

*(PV-H) < Q^{Pi>(v),v), (22) 

then for any u, 

2^{u) - *(p^(t>))) > |M«) - u|| 2 - ||v - u\\ 2 . (23) 



Proof. From d22t . we have 

$(u) - $Cfty(u)) > $(u)-Q</>bv»> w ) 

= $(u)-(^(«))+^«) + <7^(«),J>«(«)-«) + ^l|P^(«)-«lla)- ^ 
Now since and ip are convex we have 

<j>(u) > (f>(pi/,(v)) + (u-p^(u),7^(P^(v))>) ( 25 ) 

^(<u) > + (u - u,7v,(w)>, (26) 

where 7^(-) is a subgradient of (/>(■) and J^(p^(v)) satisfies the first-order optimality conditions for 
(E), i.e., 

7<a(pv( u )) + 7V>( u ) + - (Pip( v ) ~v)=0. (27) 
Summing (l25l ) and (l26b yields 

$(w) > + ( w -£V>( w )>74>(Pv( w ))) + V>( u ) + (w- u,7v;(w))- (28) 

Therefore, from EUl, (E3 and (ED it follows that 

- $(pv(«)) > (7V» +70(^(«)),«-^(w)) - ^-Ibt/.(w)-«ll2 

= -«), w -P>/») ~ ^lbv><» ~ U ll2 

□ 

Proof of Theorem \2.1\ Let I be the set of all iteration indices until k — 1-st for which no skipping 

occurs and let I c be its complement. Let I = {rii}, i = 0, . . . , k n — 1. It follows that for all n € 7 C 

x n+i = y n 

For n € I we can apply Lemma I5TI to obtain the following inequalities. In (1231 , by letting ip — f, 

4> = g, u = x* and it = .t™ +1 , we get p^(v) — y n+1 , $ = F and 

2 M (F(z*) - F(y n+1 )) > \\y n+1 - x*\\ 2 - \\x n+1 - x*\\ 2 . (29) 
Similarly, by letting tp = g, tj> = f, u = x* and v = y n in ( 1231 we getp g (v) — x n+1 , $ = F and 
2fi(F(x*) - F(x n+1 )) > \\x n+1 - x*\\ 2 - \\y n - a;*|| 2 . (30) 
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Taking the summation of d29l ) and (f30b we get 

2 M (2F(x*) - F(x n+1 ) - F(2/ n+1 )) > \\y n+1 - x*f ~ \\y n - x*\\ 2 . (31) 



For n G I c , (1291 1 holds, and we get 

2/i(F(z*) - > ||y" +1 - x*\\ 2 - \\y n - x*\\ 2 , (32) 

due to the fact that x n+1 — y" in this case. 
Summing d3TT > and d32l over ra = 0, 1, . . . , k — 1 we get 

fc-i 

2 M ((2|7| + \I c \)F(x*) - F(* n+1 ) E < 33 > 

k — 1 

>E(lly" +1 -^ll 2 -lly"-^ll 2 ) 

?i=0 

=lh/-z*ll 2 -||y -z*l! 2 

>- ||z°-z*|| 2 . 

For any n, since Lemma lBTTl holds for any u, letting u = x n+1 instead of x* we get from d2"9l that 

2fi{F{x n+1 ) - F(y n+1 )) > \\y n+1 - x n+1 \\ 2 > 0, (34) 

or, equivalently, 

2n(F{x n ) - F(y n )) >\\y n - x n \\ 2 > 0. (35) 

Similarly, for n £ I by letting u = y n instead of x* we get from ( f30b that 

2f,(F(y n ) - F(x n+1 )) > \\x n+1 - y n f > 0. (36) 
On the other hand, for n 6 I c , d36b also holds because x n+1 = y n , and hence holds for all n. 
Adding (134-b and (|36*] | we obtain 

2fx(F(y n ) - F(y n+1 )) > 0. (37) 
and adding d35l l and (f36t we obtain 

2[i(F(x n ) - F(x n+1 )) > 0. (38) 
( l37| i and (1381 show that the sequences F(y n ) and F(x n ) are non-increasing. Thus we have, 

fc-i 

E F (y" +1 ) > ^(/) and ^2F(x n+1 )>k n F(x k ). (39) 

n=0 n£7 

Combining (1331 and ( 1391 yields 

2/i ((fc + fc„)F(a;*) - k n F(x k ) - kF(y k )) > -\\x a - x*\\ 2 . (40) 
From <[35j we know that F{x k ) > F{y k ). Thus (@0]l implies that 

2fi(k + k n ) (F(y k ) - F(x*)) <\\x°- x*\\ 2 , 
which gives us the desired result fTTV 

Also, for any given e > 0, as long as k > L ^^2p7 X ^ ' we nave from (TPT1 that F(y k ) — F(a;*) < e; 
i.e., the number of iterations needed is 0(1/ e) for an e-optimal solution. □ 
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